Newer
Older
rnn_bachelor_thesis / Report / New Version / text / reweighting.tex
\section{Simulation corrections}

Physical intrinsic reasons such as non-converging QCD-calculations, free parameters in the SM and limited computation resources lead to differences between simulated signals and events observed in the experiment. One of the problems related to this misalignment is the possible bias in the attempt to increase the signal-to-noise ratio with a MVA background rejection. Therefore, the difference is estimated and reduced by adding event-weights to the simulated events in order to improve the agreement between the two distributions.

\subsection{Reweighting techniques}

The general concept of the re-weighting examined in this thesis is given as follows:
\begin{enumerate}
	\item Compare the signal distributions of the data of the normalisation channel in specific variables with the corresponding MC.
	\item Understand their differences and learn which events are likely to occur more often in the data sample.
	\item Correct the generated signal events by applying weights to each event in order to compensate the differences learnt. So events occurring more often in the real sample then in the generated sample receive higher weights and vice versa.
\end{enumerate}

To be able to compare the generated signal events and the data signal events, the \sPlot technique is used, which statistically subtracts the combinatorial background from the sample\cite{Pivk:2004ty}. Thereby weights are calculated, the sWeights, which requires to perform a fit to the \Bu mass in the data sample as shown in Fig. \ref{fig:data:massfit_for_sweights}. Throughout this section, data refers to signal sWeighted data which is handled as the equivalent of generated signal events.

\begin{figure}
	\centering
	\includegraphics[width=0.5\linewidth]{figs/sweights/K1JPsi_mm_sWeights.pdf}
	\caption{Fit to the \B mass of \Btojpsikpipimumu to obtain the sWeights.}
	\label{fig:data:massfit_for_sweights}
\end{figure}

To learn from the differences, generalise this knowledge and correct the target distribution, several different techniques are available.

\subsubsection{Binned reweighting}

A simple but widely used approach is to bin the two different samples, data and MC, in the variable which needs to be corrected. Then every bin of one sample is divided by the corresponding bin of the other sample, which results in a step-function containing the ratios.

This approach is very easy and fast, but has its limitations and disadvantages. If a bin has only a few events, the ratio fluctuates greatly and does not provide reliable weights. This is especially a problem if one considers higher dimension. As the simple approach only reweights a single variable, this is often not sufficient. Variables are one dimensional projections of a multi-dimensional distribution and therefore the reweighting does not properly account for higher order correlations. Binning however can be done in multi-dimension as well, but without a significant amount of data, the curse of dimensionality creates sparse bins with only a few events in each, leading to the fluctuations mentioned above.

\subsubsection{Gradient boosted reweighting}

An algorithm that tries to overcome those limitations is the gradient boosted reweighting\cite{Rogozhnikov:boostedreweighting}. The main characteristic of this approach is to split both samples using a decision tree (DT). The optimal split is determined by maximising a binned \chisq fit. The ratio between the number of events of both samples in each bin is calculated and applied as corrections to the MC. The same procedure is iteratively repeated by taking the weights from the previous splits into consideration. From this procedure -- discriminate samples, update data weights, repeat -- comes the "boosting" in the name.
Although this allows for good corrections in higher dimensional spaces due to the DT and low event regions, the algorithm is sensitive to its hyper-parameters and can often overfit. When using this approach it is a crucial part to make sure that the latter does not occur.

\subsection{Performance}
\label{sec:reweightperformance}

To find the optimal reweighting hyper-parameters and to be able to compare the different approaches, a metric for the reweighting quality should be established. Unfortunately, the comparison of two multi-dimensional distributions\footnote{A naive approach would be to compare the one dimensional projections, which correspond to the physical variables. But whereas two different projections imply different distributions, two similar projections do not imply similar distributions. An illustrative explanation is shown in the Appendix in Fig. \ref{fig:appendix:reweighting:ndim_dist_projections}.} is not so simple. In contrast to one dimension, no order of events is defined in multidimensional distributions. An order of events is often used in non-parametric tests like Kolmogorov-Smirnoff, Anderson-Darling \etc Although certain approaches like density kernels exist for multidimensional distributions, they are infeasible for our case due to the lack of events and/or high dimensionality. On the other hand, the question that arises is not whether a certain statistics test can distinguish our samples, but if the MVA algorithm, which will be used in the MVA afterwards, can. Therefore it seems natural to rely on its predictions to find a reliable metric.
In the following, three different approaches to investigate this problem are described.

\subsubsection{Simple discrimination}
\label{sec:simplediscrimination}

A classifier\footnote{Classifier refers to a MVA algorithm which predicts the class-label (in comparison to regression).} is trained and tested on the reweighted MC sample and on the real data using stratified k-folding and the variables that will be used later on in the MVA shown in Table \ref{tab:xgbvariables}. To test the performance of the classifier, a single-valued metric is needed. Therefore, the receiver operation characteristic (ROC) curve is drawn and the area under the curve (AUC)  is calculated\cite{ML:ROC_AUC:Bradley:1997:UAU:1746432.1746434}. Notice that the same metric will be used later in the MVA. Here the idea is that the lower this score is the less the classifier is able to discriminate the two distributions. Less discrimination power means that the two distributions are more similar under the assumption that an optimised MVA algorithm is used.

Even tough this approach yields a good idea of the similarity of the two samples, it can be blind to some kind of overfitting. The problem arises with the event weights and the randomised training- respectively test-sample drawing. If an event $a$ with an event weight $w_a$ is drawn, what actually is draw is not one event but $w_a$ times the event $a$. This is then not a randomised, uncorrelated sampling any more as drawing the event $a$ implies also drawing the event $a$ again, namely $w_a - 1$ times (for $w_a>1$). So the prerequisition to make statements, namely the randomised splitting, is not given any more. The effect from this sample biasing is that the classifier makes incorrect predictions with wrongly gained strong confidence which in turn lowers the ROC AUC more then we expect it to be. This effect is further illustrated in the Appendix in Fig. \ref{fig:appendix:reweighting:roc_auc_bias}. Although the effect decreases for large samples and is not expected to be too large for our case, it \textit{can} even lead to ROC AUC values well below the 0.5 mark, which is usually assumed to be the lowest possible score.

\subsubsection{Label the data}

Another possible approach is to train a classifier on the original MC sample without corrections as well as on the real data. This trained algorithm can be used to make predictions on three distinct samples and can be used to get hints for possible overfitting. The number of events that are predicted as real data from the following events are counted and interpreted:

\begin{itemize}
	\item MC: The lowest count is expected as most of the events will be predicted as MC.
	\item reweighted MC: A count as high as possible (but not higher then the real) is aimed for as higher values mean more events in the sample look like a real event to the classifier.
	\item real data: The highest count is expected.
\end{itemize}

Ideally, the count of the reweighted MC sample lies between the other two counts as close as possible to the real data. However, this score system should be used only as a guideline indication, since it only provides information about single events and not the distribution itself. So a real-like MC event with an extra large weight will wrongly dominate the score.

\subsubsection{Count the data}

In order to compare the distribution and not single events, a simple but robust approach is to train a classifier to discriminate between generated and real data, which is basically the same as described in Sec. \ref{sec:simplediscrimination} but instead of making predictions on both the generated and the real sample, only the latter is considered. Then the number of real events predicted as real is counted. The more the classifier was able to learn from the distributions, the more real events will be predict correctly\footnote{It has to be noted here that predictions are just a cut on the classifier output. To use the output for our purpose, equalized class-weights are required. Furthermore, the classifier itself has to generate probability-like predictions which XGBoost does. This is not per se the case for most algorithms.}. So the goal is to minimize that score. Compared to the approach described in sec. \ref{sec:simplediscrimination}, the bias due to weights is constant and originates only from the weights of the real sample. Therefore, changing the weights of the generated samples, as a reweighting algorithm does, does not change the bias. Although this score does not offer informations at the percent level of optimisation, it is a good indication of overfitting and complementary to the other scores. 

\subsection{Corrections}

To find the optimal reweighting algorithm, the scores described in sec. \ref{sec:reweightperformance} as well as visual comparisons of the variable distribution are used to estimate a good configuration. The values obtained for the different parameters are shown in table \ref{tab:gbreweightconfig}.

\begin{table}[tb]
	\caption{
		%\small %captions should be a little bit smaller than main text
		Hyper-parameter configurations for the gradient boosted reweighting. Two separated values means the first one was used for the first reweighting stage and accordingly for the second one. A single value means the same parameter was used in both stages.}
	\begin{center}\begin{tabularx}{\textwidth}{lcX}
			\hline
			Parameter 		& Value 	& Explanation\\
			\hline
			n\_estimators	& 240/140	& Number of boosting rounds to be performed (see comment \textit{learning\_rate})\\
			
			\hline
			learning\_rate	& 0.05 		& A factor by which the weights of each boosting stage are multiplied by. There is a trade-off between the learning\_rate and n\_estimators and the ratio determines (basically) how complex our model is.\\
			
			\hline
			max\_depth		& 3			& Maximum depth of the DT. Higher values create more complex models and are able to get higher order correlations but tend to over-fit.\\
			
			\hline
			min\_samples\_leaf	& 100	& Determines the minimum number of events in a leaf in order to split. Larger values create more conservative models and can help to avoid overfitting. \\
			
			\hline
			loss\_regularization & 8/10 & Adds a regularisation term to the weight inside the logarithm of the loss-function. \\
			
			\hline
			gb\_args: subsample	& 0.8 	& The fraction of the data that is used to train each DT. Reduces overfitting. \\				
			
			
		\end{tabularx}\end{center}
		\label{tab:gbreweightconfig}
	\end{table}

Two stages of corrections are applied in order to gain the best results without biasing the data. The first stage considers the \Btojpsikpipimumu decay and is responsible for the largest corrections. All variables are listed in table \ref{tab:firstreweight}. Mostly $nTracks$ as well as $nSPDHits$ seem to differ largely between MC and data.
The second stage of corrections uses the \Btojpsikpipiee sample and is less significant. It corrects the kinematics of the decay products. The variables are listed in table \ref{tab:secondreweight}.

\begin{figure}[tb]
	\begin{subfigure}{0.5\linewidth}
		\includegraphics[width=\linewidth]{figs/reweighting/stage1/b_endvertex_chi2.pdf}
	\end{subfigure}
	\begin{subfigure}{0.5\linewidth}
		\includegraphics[width=\linewidth]{figs/reweighting/stage1/b_pt.pdf}
	\end{subfigure}
	\begin{subfigure}{0.5\linewidth}
		\includegraphics[width=\linewidth]{figs/reweighting/stage1/nspdhits.pdf}
	\end{subfigure}
	\begin{subfigure}{0.5\linewidth}
		\includegraphics[width=\linewidth]{figs/reweighting/stage1/ntracks.pdf}
	\end{subfigure}

	\caption[stage 1 reweighting variables]{Variables used in the first stage of the reweighting procedure.}
	\label{fig:reweighting:firstvariables}
\end{figure}


\begin{table}[h]
	\caption{
		%\small %captions should be a little bit smaller than main text
		First stage reweighting variables in \Btojpsikpipimumu}
	\begin{center}\begin{tabular}{l | l}
			Variable	& Explanation \\
			
			\hline
			
			nTracks 	& Track multiplicity of the event. \\
			
%			\hline
			nSPDhits	& Number of hits in the scintillation pad detector. \\
			
%			\hline
			
			\B \pt		& \ptexpl\ of the \B. \\
			
			\B \chisqvtx & \chisqvtxexpl.
			
%			\hline
			
		\end{tabular}\end{center}
		\label{tab:firstreweight}
	\end{table}


\begin{table}[tb]
	\caption{
		%\small %captions should be a little bit smaller than main text
		Second stage reweighting variables in \Btojpsikpipiee}
	\begin{center}\begin{tabular}{l | l}
			Variable	& Explanation \\
			
			\hline
			
			min($\hadron_{\pt}$) 	& Minimum \pt of the hadronic decay products \\
			
			%			\hline
			max($\hadron_{\pt}$) & Maximum \pt of the hadronic decay products \\
			
			%			\hline
			
			min($\lepton_{\pt}$)		& Minimum \pt of the leptonic decay products \\
			
			max($\lepton_{\pt}$) & Maximum \pt of the leptonic decay products \\
			
			%			\hline
			
	\end{tabular}\end{center}
	\label{tab:secondreweight}
\end{table}

\begin{figure}[tb]
	\begin{subfigure}{0.5\linewidth}
		\includegraphics[width=\linewidth]{figs/reweighting/stage2/max_pt_hadrons.pdf}
	\end{subfigure}
	\begin{subfigure}{0.5\linewidth}
		\includegraphics[width=\linewidth]{figs/reweighting/stage2/max_pt_leptons.pdf}
	\end{subfigure}
	\begin{subfigure}{0.5\linewidth}
		\includegraphics[width=\linewidth]{figs/reweighting/stage2/min_pt_hadrons.pdf}
	\end{subfigure}
	\begin{subfigure}{0.5\linewidth}
		\includegraphics[width=\linewidth]{figs/reweighting/stage2/min_pt_leptons.pdf}
	\end{subfigure}
	
	\caption[stage 2 reweighting variables]{Variables used in the second stage of the reweighting procedure.}
	\label{fig:reweighting:secondvariables}
\end{figure}